Introduction to Computational Social Sciences

Malo Jan & Luis Sattelmayer

2025-01-12

Course Introduction

About us

  • Malo Jan
    • 3rd year PhD student at the CEE working on Climate politics, Legislative politics, Party politics, Quantitative and computational methods
    • malo.jan@sciencespo.fr
  • Luis Sattelmayer
    • 3rd year PhD student at the CEE working on Party Competition, Partisan Strategies, Immigration Politics, Quantitative and computational methods
    • luis.sattelmayer@sciencespo.fr

Goal of the course

  • Introduction to Computational Social Sciences
  • From data collection to analysis
  • Mostly about text analysis
  • Exposure to different methods, tools, and their applications
  • Prerequisites: solid command of RStudio
  • Goal is to provide a non-technical introduction, focus on use cases

Outline of the course

Day Session 1 Session 2 Session 3
1 Intro to CSS + Corpus selection Web scraping in R Coding session
2 Manipulating textual data From text to numbers Coding session
3 Supervised machine learning (SML) for text classification SML in R Coding session
4 Unsupervised machine learning (UML) UML in R Coding session
5 Large language models (LLMs) LLMs in Python Open

Course material

  • Slides and code will be available on the course’s Github repository
  • If you are not familiar with Git and Github, have a look at this tutorial

Computational Social Sciences

What are CSS

Computational social science is an interdisciplinary field that advances theories of human behavior by applying computational techniques to large datasets from social media sites, the Internet, or other digitized archives such as administrative records. Edelmann et al. (2020)

  • Institutionalization of CSS
    • Specific journals : eg. JCSS
    • Networks : eg. SICCS
    • Positions : Assistant professor in CSS, postdocs…

More data, computing power and methods

CSS from two main developments :

  • New and more data
    • Data on the web : digitization of text, historical archives, administrative records, open data
    • Data from the web : social media, websites
    • Unstructured data : text, images, videos, not produced by researchers
  • New methods and computing power
    • More powerful computers and hardware
    • More powerful algorithms, methods, models, IA

More data, computing power and methods

  • New data sources
  • Unstructured data: not produced by researchers (such as survey, qualitative interviews)
  • Collecting population rather than sample
  • Allows to train more complex models
  • Development of new algorithms, models in Natural Language Processing
  • Rise of LLMs
  • GPU

Developments in political science

  • Early development of computational text analysis in the 2000’s in political science : convert text to numbers to perform statistical analysis
  • But really booming over the last few years with advances in AI allowing to perform more complex analysis
  • Even more recent developments :
    • Muiltilingual text analysis
    • Image, Video as data, multimodal analysis
    • Generative language

Six principles for CSS (Grimmer, Roberts, and Stewart 2022)

  1. Social science theories and substantive knowledge are essential for research design.
  2. Text analysis does not replace humans – it augments them (see also the great paper by Do, Ollion, and Shen (2024))
  3. Building, refining, and testing social science theories requires iteration and cumulation.
  4. Text analysis methods distill generalizations from language.
  5. The best method depends on the task.
  6. Validations are essential and depend on the theory and the task.

What are textual statistics?

  • Capture latent meaning in text
    • Meaning can be discovered through the method (inductively – unsupervised)
    • Or we can use a more deductive reasoning beforehand to nudge our method toward a specific underlying pattern we are interested in (supervised)
  • This means that we are more or less informed about the latent dimensionality(ies) of our texts
  • It is also a mathematization of language
    • written or not, there is no difference as long as the text is available in a form that is readable for computers

Representing text-as-data

  • Analyzing text-as-data involves transforming text into a numeric representation in order to use statistical models (clustering, scaling, topic models, supervised machine learning etc.)
  • Featurization: extract numeric features from raw text
  • Three main steps in text analysis:
    1. Bag of word representation
    2. Static word embeddings
    3. Contextual embeddings from Large Language Models (LLMs)

The best method depends on the task

Why use these methods?

  • Counter issues of data availability
  • Uncover patterns/dimensionalities in text that humans do not recognize as such
  • Save time (scalability of methods)
  • Albeit being a quantitative method, methods covered in this class are also useful for qualitative research
    • collecting documents, corpora
    • organizing and structuring data
    • uncovering patterns inductively

Social group detection in party manifestos

Licht and Sczepanksi (2024)

The meaning of “class” in books

## Immigration stance over time in Germany

Bonikowski, Luo, and Stuhler (2022) Measuring Populism

References

Bonikowski, Bart, Yuchen Luo, and Oscar Stuhler. 2022. “Politics as Usual? Measuring Populism, Nationalism, and Authoritarianism in US Presidential Campaigns (1952–2020) with Neural Language Models.” Sociological Methods & Research 51 (4): 1721–87.
Do, Salomé, Étienne Ollion, and Rubing Shen. 2024. “The Augmented Social Scientist: Using Sequential Transfer Learning to Annotate Millions of Texts with Human-Level Accuracy.” Sociological Methods & Research 53 (3): 1167–1200.
Edelmann, Achim, Tom Wolff, Danielle Montagne, and Christopher A Bail. 2020. “Computational Social Science and Sociology.” Annual Review of Sociology 46 (1): 61–81.
Grimmer, Justin, Margaret E Roberts, and Brandon M Stewart. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
Licht, Hauke, and Ronja Sczepanksi. 2024. “Who Are They Talking about? Detecting Mentions of Social Groups in Political Texts with Supervised Learning.” ECONtribute Discussion Paper.